This is an natural language analysis on the matching soccer teams’ name when I am doing research on Betting Strategy and Model Validation. Where the subject/topic is that the last course Data Science Capstone on Coursera (JHU Johns Hopkins University) which I have failed few times and will retake on this coming October-2015 (Next month).

Note that the echo = FALSE and include=FALSE parameters were added to the code chunks below to prevent printing of the R code that generated the plots/tables. However you can feel free to see the source code via Natural Language Analysis.Rmd.

1. Setup Options, Loading Required Libraries and Preparing Environment

Setup knitr options and loading the required libraries.

Creating a parallel computing Cluster and support functions.

2. Read and Process the Dataset

Read the dataset of World Wide soccer matches from year 2011 until 2015 from a British betting consultancy named firm A.

table 2.1 48744 x 17

Due to the dataset very big 48744 x 17 caused the webpage keep loading and unable open. Here I just only subset few rows from the data frame.

Read the dataset of World Wide soccer matches scrapped from year 2011 until 2015 from spbo livescore website.

table 2.2 488929 x 20

Due to the dataset very big 488929 x 20 caused the webpage keep loading and unable open. Here I just only subset few rows from the data frame.

3. Matching the team names

3.1 Matching Duplicated Teams’ Name

In order to matching a string. Firstly we can apply match() or %in% to matching the teams’ name. Although, the capital letter different is not duplicated string in R programming while I apply the tolower() to match the teams’ name since it is consider exactly matching teams’ name in our real life.

table 3.1.1
team spbo pass
3 de Febrero 3 de Febrero Duplicated
Aachen Aachen Duplicated
Aalesund Aalesund Duplicated
12 de Octubre 12 De Octubre Capital Letters
Argentinos Juniors Argentinos juniors Capital Letters
EsPa ESPA Capital Letters

table 3.1.1 1190 x 3

3.2 Apply amatch() and stringdist()

There has a concern which is noramlly second teams’ name must be exactly same with first team but only add II, reserved etc to the first team name, for example : Mainz 05 is first team but not fifth reserved team. More soccer matches data scrapped will be more accurate, for example if we only scrapped one day data, how can we matching the first team if let say only Chelsea reserved team play on that particular date.

However there has another concern which is first team TSV 1860 Munchen but second/U19 team termed as 1860 Munchen II, 1860 Munchen U19 etc. The Lincoln team name supposed to be matched with Lincoln City but not Lincoln United while Lincoln City will be most approximately matching to Lincoln Xxitxx compare to Lincoln.

Besides, if I set the priority of matching the kick-off date and later team names, it will be a concern of possibilities of postponed staked matches (postponed after firm A placed bets, sometimes firm A will placed bets on Early market or the kick-off date accidentially changed/postponed before kick-off due to snowing/downpour/etc).

I load the stringdist package to apply the algorithmic matching amatch() the team names.

    1. osa - Optimal string aligment, (restricted Damerau-Levenshtein distance).
    1. lv - Levenshtein distance (as in R’s native adist).
    1. dl - Full Damerau-Levenshtein distance.
    1. hamming - Hamming distance (a and b must have same nr of characters).
    1. lcs - Longest common substring distance.
    1. qgram - q-gram distance.
    1. cosine - cosine distance between q-gram profiles.
    1. jaccard - Jaccard distance between q-gram profiles.
    1. jw - Jaro, or Jaro-Winker distance.
    1. soundex - Distance based on soundex encoding (see below).

Lets take an example below.

[1] “Lincoln City”

table 3.2.1 10 x 12

I simply matching the key words Lincoln in Home and Away teams’ name data which get from firm A.

table 3.2.2 10 x 12

From the two tables stated above, I apply stringdist by set the MaxDist to be default value 0.1,0.5,1.0,2.0 and also Inf and select all methods avaiable (10 methods stated above in section 3 before the run coding). Well, I dont pretend to know how does the algorimthic of stringdist() matching the string. Therefore I try both unique teams’ name and also all elements (without filter to be unique).

3.3 Apply agrep()

I tried to simply apply the agrep() function to partially matching the teams’ name.

table 3.3.1
Matching1 team1 spbo1 Matching2 team2 spbo2
Lincoln Lincoln City Lincoln Lincoln City Lincoln City NA
Lincoln NA Lincoln (MO) Lincoln City NA NA
Lincoln NA Lincoln (Pa.) Lincoln City NA NA
Lincoln NA Lincoln Red Imps Lincoln City NA NA
Lincoln NA Lincoln Reserve Lincoln City NA NA
Lincoln NA Lincoln United Lincoln City NA NA
Lincoln NA Lincoln Women Lincoln City NA NA
Lincoln NA Rivadavia Lincoln Lincoln City NA NA

table 3.3.1 8 x 6

3.4 Apply partialMatch()

Secondly, there is an article from Merging Data Sets Based on Partially Matched Data Elements which apply subset to partial matching the teams’ name.

Below table simply display few matched teams’ name which are not accurate.

table 3.4.2
teamID spboID Match
AaB Aalborg AaB Aalborg U17 Partial
Airdrie United Airdrie United Women Partial
AS Trencin AS Trencin U19 Partial
Gremio Barueri Gremio Barueri SP U20 Partial
Sao Caetano Sao Caetano Women Partial
Sheffield United Chesterfield United Women Partial

table 3.4.2 1306 x 3

From the table above we all know that the team AaB Aalborg from firm A will match with AaB Aalborg U17 from livescore website and Airdrie United match to Airdrie United Women while there are totally different team and will lead reasearcher calculate a wrong predictive figures for investment.

In order to maximized the soccer matches (observations) available for the research, here I seperates few steps to matching the teams’ name by using split() and cross-matching each others to seperately rearrange the data prior to start the algorithmic matching function in section 4 Reprocess the Data.

4. Reprocess the Data

4.1 Dicission Tree

I would like to plot a hierarchical chart for spliting teams’ name for agrep. However due to rpart and randomForest packages required numeric data while diagram doesn’t special. Here I plot two dynamic graphs.

Since the simpleNetwork() function only apply to 2 columns dataset, here I split to be 2 graphs.

4.2 Filtering and Reprocess the Data

Prior to start the algorithmic string matching, I am using the idea from Apply signature() from country names to reduce some of the minor differences between strings. In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces. So for example, United Kingdom would become kingdomunited which inside the Merging Data Sets Based on Partially Matched Data Elements. It will minimize/reduce the string distance to maximize the matching result.

Here I tried to split teams’ name into list and simply apply grep and agrep to apply first filtering.

4.3 StringDist Maximum Likelihood

There is an good example from How can I match fuzzy match strings from two datasets? which apply expand.grid() to build a data frame and then Expectation Maximization theory by using while loop on stringdist().

From the above table, I’ve matching the teams’ name which is Section 2 Dataset inside Betting Strategy and Model Validation. Here I apply method = osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, soundex inside the stringdist function. You are feel free to apply the function to scrap and also re-arrange the teams’ name and soccer scores data for your own odds price modelling.

5. Appendices

5.1 Documenting File Creation

It’s useful to record some information about how your file was created.

  • File creation date: 2015-10-29
  • R version 3.2.2 (2015-08-14)
  • R version (short form): 3.2.2
  • rmarkdown package version: 0.8.1
  • File version: 1.0.2
  • File latest updated date: 2015-11-22
  • Author Profile: Ryo®, Eng Lian Hu
  • GitHub: Source Code
  • Additional session information

[1] “2015-11-22 10:38:00 JST” setting value
version R version 3.2.2 (2015-08-14) system x86_64, mingw32
ui RTerm
language (EN)
collate English_United States.1252
tz Asia/Tokyo
date 2015-11-22
sysname release version nodename machine “Windows” “7 x64” “build 9200” “SCIBROKES” “x86-64” login user effective_user “Scibrokes” “Scibrokes” “Scibrokes”